Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms

نویسندگان

Vijaykumar Gullapalli

Andrew G. Barto

چکیده

Reinforcement Learning methods based on approximating dynamic programming (DP) are receiving increased attention due to their utility in forming reactive control policies for systems embedded in dynamic environments. Environments are usually modeled as controlled Markov processes, but when the environment model is not known a priori, adaptive methods are necessary. Adaptive control methods are often classified as being direct or indirect. Direct methods directly adapt the control policy from experience, whereas indirect methods adapt a model of the controlled process and compute control policies based on the latest model. Our focus is on indirect adaptive DP-based methods in this paper. We present a convergence result for indirect adaptive asynchronous value iteration algorithms for the case in which a look-up table is used to store the value function. Our result implies convergence of several existing reinforcement learning algorithms such as adaptive real-time dynamic programming (ARTDP) (Barto, Bradtke, & Singh, 1993) and prioritized sweeping (Moore & Atkeson, 1993). Although the emphasis of researchers studying DP-based reinforcement learning has been on direct adaptive methods such as Q-Learning (Watkins, 1989) and methods using TD algorithms (Sutton, 1988), it is not clear that these direct methods are preferable in practice to indirect methods such as those analyzed in this paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Family of Variable Step-Size Normalized Subband Adaptive Filter Algorithms Using Statistics of System Impulse Response

This paper presents a new variable step-size normalized subband adaptive filter (VSS-NSAF) algorithm. The proposed algorithm uses the prior knowledge of the system impulse response statistics and the optimal step-size vector is obtained by minimizing the mean-square deviation(MSD). In comparison with NSAF, the VSS-NSAF algorithm has faster convergence speed and lower MSD. To reduce the computa...

متن کامل

APPENDIX A: Notation and Mathematical Conventions

ion, 31Affine monotonic model, 19, 162, 189,195Aggregation, 20Aggregation, distributed, 23Aggregation, multistep, 27Aggregation equation, 25Aggregation probability, 21Approximate DP, 24Approximation models, 24, 49Asynchronous algorithms, 30, 71, 90,186, 196, 235Asynchronous convergence theorem,74, 92Asynchronous policy iteration, 23, 77,<l...

متن کامل

Empirical Q-Value Iteration

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, this algorithm doesn’t depend on a stochastic approximation-based method. We show that our algorithm, which we call ...

متن کامل

Analysis of Some Incremental Variants of PolicyIteration : First Steps Toward UnderstandingActor - Critic Learning Systems

This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actor-critic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Also related are Witten's adaptive controller (1977) and Holland's bucket brigade algorithm (1986). ...

متن کامل

Q-learning and policy iteration algorithms for stochastic shortest path problems

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1993

Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms

نویسندگان

چکیده

منابع مشابه

A Family of Variable Step-Size Normalized Subband Adaptive Filter Algorithms Using Statistics of System Impulse Response

APPENDIX A: Notation and Mathematical Conventions

Empirical Q-Value Iteration

Analysis of Some Incremental Variants of PolicyIteration : First Steps Toward UnderstandingActor - Critic Learning Systems

Q-learning and policy iteration algorithms for stochastic shortest path problems

عنوان ژورنال:

اشتراک گذاری